Goto

Collaborating Authors

 multiple generation


Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass

Neural Information Processing Systems

Many applications today provide users with multiple auto-complete drafts as they type, including GitHub's code completion, Gmail's smart compose, and Apple's messaging auto-suggestions. Under the hood, language models support this by running an autoregressive inference pass to provide a draft. Consequently, providing k drafts to the user requires running an expensive language model k times. To alleviate the computation cost of running k inference passes, we propose Superposed Decoding, a new decoding algorithm that generates k drafts at the computation cost of one autoregressive inference pass. We achieve this by feeding a superposition of the most recent token embeddings from the k drafts as input to the next decoding step of the language model.


Beyond the Singular: The Essential Role of Multiple Generations in Effective Benchmark Evaluation and Analysis

Zhang, Wenbo, Cai, Hengrui, Chen, Wenyu

arXiv.org Artificial Intelligence

Large language models (LLMs) have demonstrated significant utilities in real-world applications, exhibiting impressive capabilities in natural language processing and understanding. Benchmark evaluations are crucial for assessing the capabilities of LLMs as they can provide a comprehensive assessment of their strengths and weaknesses. However, current evaluation methods often overlook the inherent randomness of LLMs by employing deterministic generation strategies or relying on a single random sample, resulting in unaccounted sampling variance and unreliable benchmark score estimates. In this paper, we propose a hierarchical statistical model that provides a more comprehensive representation of the benchmarking process by incorporating both benchmark characteristics and LLM randomness. We show that leveraging multiple generations improves the accuracy of estimating the benchmark score and reduces variance. We also introduce $\mathbb P\left(\text{correct}\right)$, a prompt-level difficulty score based on correct ratios, providing fine-grained insights into individual prompts. Additionally, we create a data map that visualizes difficulty and semantic prompts, enabling error detection and quality control in benchmark construction.


Cost-Effective Hallucination Detection for LLMs

Valentin, Simon, Fu, Jinmiao, Detommaso, Gianluca, Xu, Shaoyuan, Zappella, Giovanni, Wang, Bryan

arXiv.org Machine Learning

Despite their impressive capabilities, large language models (LLMs) can be prone to generating hallucinations -- undesirable outputs that are incorrect, unfaithful, or inconsistent with respect to the inputs (or the output itself) [1]. These unreliable behaviors pose significant risks for adopting LLMs in real-world applications. Challenges in detecting hallucinations lie, among other things, in hallucinations taking different forms, being context-dependent and sometimes being in conflict with other desirable properties of generated text [2, 3]. Hallucinations may be harmless in some contexts, but can be undesired or potentially dangerous in other applications (e.g., erroneous medical advice). Detecting and quantifying hallucination risk is thus a critical capability to enable safe applications of LLMs and improve generated outputs. Prior work has proposed various approaches for detecting and mitigating hallucinations in LLM-generated outputs, including verifying faithfulness to inputs [4], assessing internal coherence [5], consulting external knowledge sources [6], and quantifying model uncertainty [2, 3, 7, 8]. However, deploying these methods in production settings is far from trivial due to several challenges: First, there is limited comparative evaluation illuminating how different detection methods perform. Second, existing approaches for detecting hallucinations differ greatly in their computational demands, and guidelines are lacking on cost-effectiveness trade-offs to inform method selection for real-world applications with constraints. Third, hallucination detection in the real world often requires careful consideration of risks and false positive/negative trade-offs, requiring methods to provide well-calibrated probability scores.


Epigenomics Now

Communications of the ACM

For nearly a quarter-century, we have had a (mostly) complete listing of the human genome, the three-billion "letter" sequence of DNA, most of which is the same for all of us. This reference copy makes it much easier for scientists to understand biological processes and to identify the individual variations, such as mutations, that contribute to disease. Despite its central role and its extreme usefulness, however, the genome's impact on healthcare has been smaller than many proponents had hoped. Part of the reason is that while most of the cells in the human body carry identical DNA, the biological activity of different regions varies widely over time and between different tissues. It is these differences in gene expression that orchestrate the intricate development of tissues and the unique features of various cell types, as well as much of the misbehavior of cells in disease.


Learning a Hierarchical Planner from Humans in Multiple Generations

Cano, Leonardo Hernandez, Pu, Yewen, Hawkins, Robert D., Tenenbaum, Josh, Solar-Lezama, Armando

arXiv.org Artificial Intelligence

A typical way in which a machine acquires knowledge from humans is by programming. Compared to learning from demonstrations or experiences, programmatic learning allows the machine to acquire a novel skill as soon as the program is written, and, by building a library of programs, a machine can quickly learn how to perform complex tasks. However, as programs often take their execution contexts for granted, they are brittle when the contexts change, making it difficult to adapt complex programs to new contexts. We present natural programming, a library learning system that combines programmatic learning with a hierarchical planner. Natural programming maintains a library of decompositions, consisting of a goal, a linguistic description of how this goal decompose into sub-goals, and a concrete instance of its decomposition into sub-goals. A user teaches the system via curriculum building, by identifying a challenging yet not impossible goal along with linguistic hints on how this goal may be decomposed into sub-goals. The system solves for the goal via hierarchical planning, using the linguistic hints to guide its probability distribution in proposing the right plans. The system learns from this interaction by adding newly found decompositions in the successful search into its library. Simulated studies and a human experiment (n=360) on a controlled environment demonstrate that natural programming can robustly compose programs learned from different users and contexts, adapting faster and solving more complex tasks when compared to programmatic baselines.